fix: filter non-productive TCRs, dynamic gene family columns, single-…#82
Open
KevinMLanderos wants to merge 1 commit into
Open
fix: filter non-productive TCRs, dynamic gene family columns, single-…#82KevinMLanderos wants to merge 1 commit into
KevinMLanderos wants to merge 1 commit into
Conversation
…cell quality filters
Three bugs were fixed:
1. bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were
built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently
dropping any gene with a number above those limits. The max index is now
derived dynamically from genes observed in each sample. Samples with no valid
calls for a gene type write a sample-only row instead of crashing.
2. modules/local/annotate/main.nf — ANNOTATE_PROCESS did not filter for
productive TCRs, so non-productive rearrangements propagated into every
downstream file: concatenated_cdr3_sorted.tsv, OLGA pgen calculation,
TCRSHARING, and the patient workflow (GIANA, GLIPH2, overlap metrics).
The process now reads the 'productive' column when present and retains only
productive entries before writing per-sample _cdr3.tsv files, fixing all
downstream analyses in one place.
3. bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or
contig quality before pseudobulking. is_cell, high_confidence, and productive
filters are now applied in both pseudobulk() and pseudobulk_phenotype() when
those columns are present, ensuring background barcodes, low-confidence
assemblies, and non-productive contigs are excluded from single-cell input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Unit Test Results10 tests 2 ✅ 21s ⏱️ For more details on these failures, see this check. Results for commit 389a328. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three bugs were fixed:
bin/sample_calc.py — Gene family CSVs (v_family, d_family, j_family) were built with hard-coded maximum indices (TRBV: 30, TRBD: 2, TRBJ: 2), silently dropping any gene with a number above those limits. The max index is now derived dynamically from genes observed in each sample. Samples with no valid calls for a gene type write a sample-only row instead of crashing.
modules/local/annotate/main.nf — ANNOTATE_PROCESS did not filter for productive TCRs, so non-productive rearrangements propagated into every downstream file: concatenated_cdr3_sorted.tsv, OLGA pgen calculation, TCRSHARING, and the patient workflow (GIANA, GLIPH2, overlap metrics). The process now reads the 'productive' column when present and retains only productive entries before writing per-sample _cdr3.tsv files, fixing all downstream analyses in one place.
bin/pseudobulk.py — Cell Ranger AIRR output was not filtered for cell or contig quality before pseudobulking. is_cell, high_confidence, and productive filters are now applied in both pseudobulk() and pseudobulk_phenotype() when those columns are present, ensuring background barcodes, low-confidence assemblies, and non-productive contigs are excluded from single-cell input.